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Abstract 

We examine the problem of optimizing classification tree evaluation for on-line and real-time appli- 
cations by using GPUs. Looking at trees with continuous attributes often used in image segmentation, 
we first put the existing algorithms for serial and data-parallel evaluation on solid footings. We then 
introduce a speculative parallel algorithm designed for single instruction, multiple data (SIMD) architec- 
tures commonly found in GPUs. A theoretical analysis shows how the run times of data and speculative 
decompositions compare assuming independent processors. To compare the algorithms in the SIMD 
environment, we implement both on a CUDA 2.0 architecture machine and compare timings to a serial 
CPU implementation. Various optimizations and their effects are discussed, and results are given for all 
algorithms. Our specific tests show a speculative algorithm improves run time by 25% compared to a 
data decomposition. 

keywords: Classification Trees, Decision Tree Evaluation, Parallel Algorithms, GPU Computing, 
Speculative Decomposition, Optimization, Image Segmentation. 



1. Introduction 

Classification trees are used to solve problems in ar- 
eas as diverse as target marketing, fraud detection, 
pattern recognition, computer vision, and medical 
diagnosis. In many applications, classification trees 
are carefully designed once but then applied to many 
data sets to provide automated classifications. This 
approach is used to create validated classifiers for tis- 
sue classification in mammography |12| and intravas- 
cular ultrasound |TT] diagnostic procedures. While 
training the classifier is done offline, tree evaluation of 
each patient's data in these applications is an on-line 
algorithm where a user waits for a classification to be 
performed on many, many samples. Time spent wait- 
ing for this evaluation consumes valuable procedure 
room equipment and personnel. Performance require- 
ments only increase when single images are replaced 
by moving video for computer vision applications, as 
in robotic navigation [lj. In this environment, many 
classifications are needed in real-time to compute and 
affect a timely response. Thus the need for high- 
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performance on-line evaluation of classification trees 
ranges from beneficial to absolutely necessary. 

The assignment of a class to a given sample from 
a dataset requires that the sample be evaluated at 
each decision point along its path from the root of 
the tree to its eventual terminal leaf. While it may 
seem that each decision must be made in series for 
that sample, we note that each sample's classifica- 
tion is independent of all other samples. This allows 
us to decompose the problem of classifying all sam- 
ples in a dataset into the independent problems of 
classifying each sample, which can be done in par- 
allel. This decomposition according to sample data 
(a data decomposition approach) makes a growing 
number of parallel computing architectures available 
to speedup tree evaluation. 

There is a good deal of literature on paralleliza- 
tion of training algorithms used to create classifica- 
tion trees [H [51 E HOI HH HH] in a traditional par- 
allel processing setting. Research on the tree eval- 
uation problem, however, seems to focus on Graph- 
ics Processing Units (GPUs) as the implementation 
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platform. GPUs are typically designed specifically 
for data parallel applications. As inexpensive, com- 
modity hardware found on every standard PC, GPUs 
match the cost, size, and power requirements of the 
on-line tree evaluation problem setting more closely 
than traditional super computers. Such application 
of graphics hardware to generic problems has become 
known as General Purpose GPU (GPGPU) comput- 
ing. 

An early expedition into GPGPU techniques for 
machine learning can be found in |16| , but application 
to tree evaluation was first proposed by Sharp in |15j . 
His framework stores the tree as an array of nodes 
containing the decision criteria of that node and an 
index used to locate the next node. Subsequent node 
indices are computed without conditional branches 
to avoid their heavy performance penalties on most 
GPUs. The tree definition is passed to the GPU as 
a texture map used by a custom pixel shader. The 
shader consumes input feature data and combines it 
with the texture to produce a final value, the assigned 
class, for each pixel in parallel. Sharp extends this 
to evaluate random forests by concatenating multi- 
ple tree structures in the texture data and iterating 
over all trees. Results show a speedup of roughly two 
orders of magnitude over host-based algorithms. 

In [T], Baumstarck also uses a data parallel ap- 
proach on GPUs for a computer vision application, 
available in [5|. The implementation is done di- 
rectly on the Compute Unified Device Architecture 
(CUDA) platform offered by NVIDIA Corporation [3 
without using graphics libraries. Though condition- 
als are used in the tree traversal, Baumstarck reports 
a fifty- fold speedup of forest evaluation. 

In this paper, we investigate a speculative ap- 
proach to tree evaluation on massively parallel GPU 
architectures, namely CUDA. Rather than treating 
the full evaluation of one sample as the atomic paral- 
lel task, we parallelize the evaluation of each node in 
the tree for a single sample then reduce the resulting 
path through the tree in parallel. This approach has 
some performance benefits on architectures where ex- 
ecution of parallel processors is not independent, as 
in SIMD machines. We compare this approach to the 
data decomposition used in previous work and to the 
best-known serial host algorithm, both of which we 
restate here so that all approaches are put on a solid 
footing. In the specific environment we examine, re- 
sults for speculative decomposition show a 25% per- 
formance improvement over data decomposition. We 
also see that host memory bandwidth and data dis- 



tribution is an important measurement consideration 
that can dominate the nuances of GPU performance 
gains in typical PC systems, and must be accounted 
for in any statement of speedup results. 



2. Preliminaries 

2.1 Classification Trees 

In evaluating a classification tree, we are given a set 
of records, called the dataset, and a full binary deci- 
sion tree, called the classifier. Each record in the 
dataset contains several fields, called attributes or 
features. One of the attributes, the classifying at- 
tribute, indicates to which class the record belongs 
and is unknown. In the general case, attributes can 
be continuous, having (real) numerical values from 
an ordered domain, or categorical, representing values 
from an unordered set. The classifier is a predictive 
model created through a process known as training. 
In training, observations on a training set of records, 
each having a known classifying attribute, are used to 
build a tree such that each interior, or decision, node 
uses a single attribute value test to partition the set 
of records recursively until the subset of records at a 
given node have a uniform class. Such nodes are en- 
coded in the tree as leaf nodes. The evaluation of a 
dataset is complete when the trained classifier is used 
to determine to which leaf, and thereby which class, 
each record belongs. 

There are several training algorithms for exam- 
ining attributes and generating trees. The particu- 
lar algorithm used will not concern us here, so long 
as the resulting tree has the above properties. We 
examine trees where all attributes are continuous, a 
common occurrence in image segmentation. While 
we will look at real-valued attributes (approximated 
with floating point numbers), ordered discrete val- 
ues would behave very much the same. Categorical 
attributes, though, would likely require some modifi- 
cations to our approach. We will further assume that 
class values can be enumerated and put into one-to- 
one correspondence with the natural numbers. Evalu- 
ation will operate only on numbers, and any mapping 
to another representation for class values (e.g. to de- 
scriptive strings or pixel values) will be done outside 
the evaluation process. 

2.2 CUDA GPUs 

GPGPU computing has grown in popularity in recent 
years as a technique for improving performance for 
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massively parallel applications, especially where visu- 
alization and images are concerned. Initially, generic 
parallel computing was achieved on GPUs by cleverly 
mapping the processing into the graphics domain us- 
ing libraries such as OpenGL to perform primitive 
tasks. As demand for customized graphics process- 
ing grew, vendors began supporting domain-specific 
programming languages like GL Shading Language 
(GLSL), making the GPU's floating point units more 
available. 

In recent years, GPGPU computing frameworks 
have made great strides in removing assumptions 
about the domain and providing a generic capabil- 
ity to be used in any application needing massive 
parallelization. Perhaps the leading such framework, 
NVIDIA 's CUDA architecture, can add tens or hun- 
dreds of GigaFLOPs to a system's capability on a 
single adapter card. 

This power can be brought to bear on generic 
problems with great ease of use. The program- 
ming environments for these devices, whether vendor- 
specific or the industry standard OpenCL, can be 
used with no reference to the graphics domain. These 
environments subset the C/CH — programming lan- 
guage and provide a set of keyword extensions to 
manage the generation of both device-specific code 
and host code from the same source file set. In this 
way, code written to run on the GPU, called a kernel, 
is invoked with something that feels very akin to a C 
function call. 



local memory (registers and stack), the shared mem- 
ory of their block, and a global memory common to 
the entire device. The host can read from and write 
data to global memory but not local or shared mem- 
ory. The host is required to copy kernel input and 
output data to and from device global memory out- 
side of the kernel execution. 

A simple example helps to illustrate a typical ker- 
nel invocation. First, the host CPU copies the in- 
put data to the GPU device's global memory. Since 
the host and device address spaces are separate, the 
CUDA runtime provides the host with APIs to allo- 
cate storage in device space, copy memory between 
spaces, look up device space symbol addresses, etc. 
The host must also allocate device global memory to 
store the results of the computation. The host can 
then invoke the CUDA runtime to launch the kernel 
with certain grid and block dimensions. Arguments 
such as the input and output buffers in device space 
are passed in the invocation. The device allocates 
execution resources to the kernel grid and schedules 
threads to execute in warps. Each thread uses its 
block and thread indices to identify its associated 
portions of input and output data. It can then do 
thread-specific memory transfer to its own stack and 
registers. Once the input data is locally available, 
computation is done and output is stored in device 
global memory. When all threads have completed, 
the host is signaled and is then free to copy the results 
from device to host memory and deallocate buffers. 



2.2.1 CUDA Programming Model 

The CUDA runtime executes kernels across many 
threads, or individual streams of instructions (usu- 
ally for a single atomic parallel task), and manages 
the mechanics of scheduling in hardware. Threads are 
grouped into blocks as 1, 2, or 3 dimensional arrays 
with each thread having a unique identifying index 
in each dimension of the block. Further, blocks are 
grouped into a 1 or 2 dimensional grid, with each 
block again having an identifying index in the grid 
dimensions. Each kernel invocation is done over a 
single grid and gives the grid and block dimensions 
to use when launched. Threads within a block are 
allowed to synchronize and share memory, but no 
communication between blocks is allowed. Threads 
are scheduled and executed in 32-thread units called 
a warp, with some operations happening on a half- 
warp, or 16 threads. All threads have access to their 

1 NVIDIA represents that CUDA is a extension to ANSI C, 



2.2.2 CUDA Hardware Architecture 

While an extensive discussion of CUDA architecture 
is beyond the scope of this paper, some of the algo- 
rithm designs given herein are driven by certain qual- 
ities which bear discussion. The fundamental execu- 
tion units of a CUDA device, called stream processors 
and known as cores, are arranged in N-w&y SIMD 
groups for some implementation-dependent ./V (usu- 
ally 8, 32, or 48). These groups are combined with su- 
per function units (SFUs), instruction cache/decode 
logic, a register file, LI cache/shared memory, (usu- 
ally 2) warp schedulers, and a network interconnect 
to form a streaming multiprocessor, or SM (Figure 
[T|). All threads in a block will be executed on the 
same SM, scheduled very efficiently by the hardware 
warp schedulers. When a warp is scheduled, all 
threads in that warp execute the same instruction, 
but have their own registers and stack. When some 

>ut recent versions also allow for the use of classes. 
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threads take conditional branches different from other 
threads, the warp executes the two paths in series un- 
til the paths merge. This is known as a divergent path, 
and can affect the kernel's performance substantially. 

When a warp encounters a long-latency instruc- 
tion (such as global memory access), it can be 
swapped for another warp in a small number of 
clocks. There is a limit to this capability, however, 
and the SM can only have so many blocks and threads 
resident at a time. This concept is known as occu- 
pancy, and can also affect the kernel's performance. 
Low occupancy means an SM has nothing to do dur- 
ing long latency instructions, so the SM is not fully 
utilized. 
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Figure 1: Streaming Multiprocessor detail (NVIDIA 
Corporation) 

Finally, accessing global memory from an SM is 
an expensive operation, typically 100 times the cost 
of accessing local memory. In some CUDA implemen- 
tations, accesses to global memory that meet certain 
requirements (such as contiguous access of 32, 64, or 
128 bytes made in order by each core) can be coalesced 



into a single read, improving throughput. Later ver- 
sions of CUDA hardware add LI and even L2 cache, 
which mitigates the cost of non-coalesced reads. 

See [31 HI IHJ H3] for a more complete and detailed 
overview of the CUDA architecture. 



3. Classification Tree Algorithms 

It is natural to imagine an algorithm for evaluating a 
decision tree using a binary tree data structure and 
a depth-first traversal which, at each node, uses a 
conditional to evaluate whether the traversal should 
follow the left or right child of the node. Condi- 
tional statements, however, present problems for tra- 
ditional CPUs (in the form of branch misprediction 
and pipeline flush) and GPUs (in the form of se- 
rialized divergent paths for SIMD warp execution.) 
Sharp avoids this problem in |15| by developing a 
branchless tree traversal, which we will adopt for the 
base serial evaluation algorithm. A host implemen- 
tation of this algorithm, as the best known serial al- 
gorithm, will be the reference by which speedup of 
parallel algorithms is determined. 

3.1 Branchless Tree Evaluation 

The evaluation problem can be stated as follows: 
given a dataset T> = {R : R = [r\, . . . , ta), f a G M.} 
with | T> |= M and a full binary classification tree r 
with a set of nodes Af — {n : n — (a n , t n , d£, c„)} 
where: 

• | Af | = N is the number of nodes in r 

• 1 < o, n < A is the index of attribute r an in each 
record R to be evaluated by node n 

• t n € M is the threshold for attribute r an used 
by node n 

• d„ G {Af (J 0} is n's left descendant and recur- 
sively evaluates R when r an < t n 

• g?,j € {Af U 0} is n's right descendant and re- 
cursively evaluates R when r an > t n 

• c„g{C1J_L: CcNis the set of possible class 
values} is _L when (d^ ^ A 0) or some 
c G C when (d£ = /\ d l n = 0) 

and having a root node no, assign to each R G T> a 
Cr G C by recursively evaluating R beginning at no- 
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Procedure 1 Breadth-first Encoding of Tree 
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Procedure 2 Serial Tree Evaluation 

l: Parameter: D 

2: Parameter: breadthFirstTree[N] 
3: Output: assignedClasses[\T>\] 
4: for all R e V do 

5: i = 

6: while breadthFirstTree[i]. class Val = 1 do 
7: a = breadthFirstTree[i]. attributelndex 
8: t = breadth FirstTree[i] .threshold 
9: i = breadthFirstTree[i] .childlndex + (r a > 
t) 

10: cr = breadthFirstfTree[i]. class Val 

11: assignedClasses[R] = cr 



To evaluate r without branching, we first encode 
A/" in a breadth-first array of nodes. Procedure [T] 
shows how each node is assigned an index i in the 
array breadthFirstTree to create a data structure 
describing the tree. Note that every right child has 
an index that is one more than the neighboring left 
child. Each node, then, need only store the index of 
its left child. To compute the index of the next node 
to evaluate, the node compares its attribute value r arl 
against its threshold t n using the Boolean predicate 
"greater-than." If the result is false and encoded as 0, 
adding the result to the node's child index will yield 
the index of its left child, as desired. If the result is 
true encoded 1, adding it to the child index will 
yield the node's right child's index. While not strictly 



branchless due to the while loop, this technique does 
avoid any explicit conditional to compute the path to 
take at each decision node. The general algorithm is 
shown in Procedure [2] 

3.2 Data Decomposition 

Procedure [2] is parallelized by data decomposition al- 
most trivially, since each record is independent of 
the others. We can simply assign m records to p 
processors and have each loop only over m. The 
only additional work is to map the m records to 
the global dataset for the purposes of indexing into 
the input and output arrays. Procedure [3] shows 
the algorithm for each processor with indexing de- 
tails for parameters T> and assignedClasses. We use 
V[s..t) to mean the subset of elements of T> begin- 
ning at clement s up to but not including element 
t. Here, we assume a shared memory architecture 
so that all processors have equal access to the pa- 
rameter and output buffers. Knowing the index to a 
record R in T> also gives the index to the correspond- 
ing assignedClasses value. The steps of making 
T>, breadthFirstTree, and assignedClasses avail- 
able to each processor are omitted. 

[13] uses a data parallel approach similar to this, 
as does [I] when evaluating boosted decision trees 
using CUD A, though the later uses conditional in- 
structions to traverse the tree. 



Procedure 3 Data-Parallel Tree Evaluation 

l: Parameter: T) 

2: Parameter: breadthFirstTree[N] 

3: Parameter: m € N, the number of records for 

this processor to process 
4: Parameter: pgN, this processor's rank 
5: Output: assignedClasses[\ T> |] 
6: for all R e V[m ■ p .. m(p + 1)) do 
7: i = 

8: while breadthFirstTree[i]. classVal = _L do 
9: a = breadthFirstTree[i\. attributelndex 
10: t = breadthFirstTree[i] .threshold 
11: i = breadthFirstTree childlndex + (r a > 
t) 

12: cr — breadthFirstTree[i).cl'ASsVal 
13: assignedClasses[R] = cr 



3.3 Speculative Decomposition 

While a data decomposition applies multiple proces- 
sors to the evaluation problem very efficiently, the 
task of evaluating a single tree is still done serially. 
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Once m is reduced to 1, no further processors can 
be applied to the problem usefully. Also, very deep 
and unbalanced trees may lead to asymmetries in 
the runtime between processors. In image segmenta- 
tion, for instance, neighboring samples arc expected 
to take similar paths through the tree and have al- 
most uniform class values. By luck of the draw, 
some processor may be assigned m records that hap- 
pen to be classified by the deepest node in the tree 
while others have records classified at the top of the 
tree. This leads to idle time in the "lucky" processors, 
and thereby, practical inefficiency. Further, adjacent 
records taking different paths leads to similar inef- 
ficiencies in SIMD architectures like CUDA SMs or 
Intel's SSE instruction set. 

We propose a speculative decomposition of the 
problem to avoid the issues of divergent paths, irreg- 
ular memory access patterns, idle time due to asym- 
metrical processing times, and to provide more uni- 
form evaluation times needed in deterministic, real- 
time applications. We assign to each record a group of 
p processors, called a record group, such that p = N . 
If there are G such groups, the total number of pro- 
cessors becomes P = Gp. Within the group, each 
node n of the tree is assigned to processor p n . The 
first step of the algorithm is to evaluate all nodes in 
parallel. Each processor stores the child node index 
i determined by the node evaluation into a shared 
memory array, path, having one element for each pro- 
cessor. The second step is to reduce the path through 
the tree to the selected leaf node. This is done by 
having each processor copy the path value of its child 
node into its own element of path. That is, each node 
finds its successor's successor and adopts that as its 
own successor. We can then think of the path ar- 
ray as storing the eventual successor for each node, 
with the eventual successor of the root node being 
the terminal node for the record. This step must be 
done synchronously across all processors in the record 
group. Leaf nodes are specifically designed to always 
evaluate to themselves by setting their threshold to 
— oo and their child index to be their own index. 

Figure [2] shows an example tree and the path ar- 
ray after the initial node evaluation ( 2b I , then after 
one pel and two ( 2d I steps of the parallel reduction 



phase. Note that for a tree of depth d, only 0(log 2 d) 
reduction steps are necessary for the root node to ar- 
rive at the terminal leaf's index. When this occurs, 
the reduction terminates. 

Procedure [4] gives the parallel algorithm, which 
handles indexing the dataset as before but now ac- 



counts for the specific record group g in the cal- 
culation as well as determining which node of the 
tree each processor is assigned to and setting up the 
shared variable path. To compute the dataset indices, 
we can follow the form of Procedure [3] but substitute 
g for p. Again, we assume a shared arrangement for 
the input dataset and the output assignedClasses 
where the indices in each array correspond naturally. 
We use the primitive barrier() to provide synchro- 
nization on updates to path from within record group 



Procedure 4 Speculative Parallel Tree Evaluation 
l: Parameter: D 

2: Parameter: breadthFirstTree[N] 

3: Parameter: m £ N, the number of records for 

this record group to process 
4: Parameter: g £ N, the record group this pro- 
cessor belongs to 
5: Parameter: p n £ N, this processor's rank in the 

record group 
6: Output: assignedClasses[\ T> |] 
Shared Variable: path[N] 
for all R £ T>[m ■ g .. m(g + 1)) do 
9: a = breadthFir stTree[p n ]. attributehidex 
10: t = breadthFir stTree[p n ]. threshold 
11: path[p n ] — breadthFirstTree[p n ].childlndex + 

(r a >t) 
12: barrier (g) 
13: rootClass = 

breadthFir stTree[path[0] ] .class Val 
14: while rootClass — _L do 
15: path[p n ] — path[path[p n ]] 
16: barrier(g) 
17: rootClass = 

breadt hFirstTree [pat h [0] ] .classVal 
18: Cfj = rootClass 
19: assignedClasses[R] = cr 



3.4 Improved Speculative Decomposition 

A few inefficiencies exist in Procedure [4] First, pro- 
cessors assigned to leaf nodes will always produce the 
same, known output, and so their assigned processors 
do no productive work. To avoid this waste, the path 
array can be initialized with the known, static results 
for all leaves. Processors will only be assigned to deci- 
sion nodes such that < p n < (N — 1) /2, the number 
of internal nodes in a full binary tree. This means, 
however, that mapping processors in a record group 
to tree nodes is no longer a simple, sequential opera- 
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(a) Example Tree 

Figure 2: Parallel Tree Path Reduction 



tion. A tree-specific look-up table can accommodate 
this. As the record group processes, each processor 
will modify only the element of path it is assigned to. 



Second, if the tree reduction is viewed probabilis- 
tically, we see that most records will end up at some 
leaf between levels 1 and d of the tree, averaging to 
some g? m for the dataset. Checking the while condition 
on line 14 of Procedure [4] for all levels d r < leads 
to an expected inefficiency. If d^ is known or can 
be determined experimentally for the tree, reducing 
d^ levels in a single while loop pass can provide an 
average case performance enhancement by reducing 
loop iterations and the number of barrier operations 
required. 



Procedure [5] gives the improved parallel algorithm 
for speculative decomposition. We add the static 
paths for the leafs of the tree on line [3] and use that to 
initialize the path array in parallel on line |10| Each 
processor must now initialize two elements of path 
since there are only processors for the internal nodes. 
We also add the processor-node map on line [4j which 
records the node index i assigned to each processor. 
Line [20] shows the concept of multiple reductions per 
loop, though the optimal implementation will be tree- 
specific. 



Procedure 5 Speculative Parallel Tree Evaluation 
l: Parameter: T> 

2: Parameter: breadthFirstTree[N] 

3: Parameter: leaf Paths[N] 

4: Parameter: processor NodeMap[(N — l)/2] 

5: Parameter: m £ N, the number of records for 

this record group to process 
6: Parameter: g £ N, the record group this pro- 
cessor belongs to 
7: Parameter: p n £ N, this processor's rank in the 

record group 
8: Output: assignedClasses[\ T> |] 
9: Shared Variable: path[N] 
10: path[2p n ] = leafPaths[2p n ] 
11: path[2p n + 1] = leafPaths[2p n + 1] 
12: i — processor NodeMap[p n ] 
13: for all R £ V[m ■ g .. m(g + 1)) do 
14: a = breaAthFirstTree[i].eA,trib\itelndex 
15: t = breadthFirstTree[i].thieshold 
16: path[i] — breadthFirstTree[i].c\\i\dlndex + 

(r B > t) 
17: barrier(g) 
18: rootClass = 

breadthFirstTree[path [0] ] .class Val 
19: while rootClass — _L do 
20: path[i] = path[path[path[i]\] 
21: barrier(g) 
22: rootClass — 

breadthFirstTree[path[0\\.classVal 
23: cp; = rootClass 
24: assignedClasses[R] = cr 
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3.5 Management and Tuning of Parallel 
Algorithms 

Some management work is required for each algo- 
rithm in sections 13.21 13.31 and 13.41 but is omit- 



ted for brevity and to preserve generality. This in- 
cludes making the buffers for T>, assignedClasses, 
breadthFirstTree, and any of the other necessary 
symbols available to all the parallel processors for 
each algorithm. The mechanism for sharing these 
buffers depends on the programming environment 
used. Also, selection of optimal values for G and 
m given P, N, M , and the available execution hard- 
ware architecture is critical but entirely implementa- 
tion dependent. 

3.6 Analysis of Evaluation Algorithms 

We now analyze the asymptotic behavior of these gen- 
eral algorithms assuming a traditional parallel pro- 
cessing setting of independent processors connected 
via shared memory. We perform an average case run 
time analysis by assigning d^to the be average depth 
of the tree traversed by the records in the dataset. 
This can be determined if the entire dataset is known 
a priori, or can be statistically estimated given an 
significant sample size, such as the training set. The 
serial runtime for Procedure [2] for M records is given 
by 

T 2 = Md ti (t e + t c ) 

where t e is the time to evaluate a node's attribute 
against its threshold and t c is the time to compare 
the new node's class value to _L. We also refer to 
tn = t e + t c as the time needed to evaluate a node. 

The run time for Procedure [3] is a function of P, 
the total number of processors applied, and is given 
by 



T 3 (P) 



M 



d„(t e + t c ) + U + t s (M) 



where each processor classifies ^ records, ti is the 
time needed to compute the index in T> assigned to 
the each processor, and t s (M) is the time needed to 
transmit M records on the shared memory machine 
for processing. We can then examine the speedup of 
Procedure [3] as 



S 3 (P) = 



T-2 



Md^U+U) 



UP) 



fd^t e + t c )+U+t s {M) 
P 



P(U+t B {M)) 



If we assume t s {M) — oM + 7 for some tr, 7 and we 
ignore 7 and ti as small constants, then this simplifies 
asymptotically to 



Ss(P) 



P 



1 



Pa 
d„ t n 



which suggests the speedup will be decided by the 
relative performance of the memory copy and the se- 
rial node processing time. If they are very similar, we 
would not expect much speedup. If memory copies 
are very fast compared to node processing, some ben- 
efit may be had. Likewise for the efficiency, given by 



e 3 (p) 



s 3 (P) 
p 



we expect good results only when copy time is much 
less than processing time. 

For Procedure [5] the analysis is a bit more in- 
volved. If each group of processors is assigned m = 

records for G groups of p processors such that 
P = Gp, the parallel runtime is given by 



T 5 (P) 



Mp 
~P~ 



{t e + (\og 2 d^t c )+t t +t s {M) 



and the speedup is 
S 5 (P) = 



T 5 (P) 

^ (t e + (log 2 d fl )t c ) + U + t s (M) 



P 



p(t e + (log 2 d„)t c ) P(U+t 3 (M)) 
d M (te+tc) + Md»(t e +t c ) 



with efficiency 



E 5 (P) 



S 5 (P) 
P 



p(t e + (log 2 d„)t c ) Per 
d^(t s +t ) f d^tn 



Making the same assumptions about t s (M), ti, and 
7, S5(P) simplifies asymptotically to 



P 



p(t e +(Iog 2 rf^tc) j_ Pa 
d„t„ 



d„(* e +i ) 



For the values of P and d M we examine, this should 
not be very different from ^(P). However, these 
equations allow us to examine when S^(P) > S3 (P), 
which occurs when 
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p{t e + (log 2 dp)t c ) Qr 

p{t e + (\og 2 d fl )t c ) < dftite+tc) 

d^{t e + t c ) 



p < 



t e + (log 2 dfj) t c 



If we further assume t e and t c are roughly equivalent 
operations (both being comparisons) and each taking 
time t, we can simplify this to 



P < 



P < 



2id„ 



t (1 + log 2 dp) 
1 + log 2 d M 



(1) 



For practical values of d^, the slope of the graph of 
[T| is around Y 3 - Since the number of decision nodes 
grows faster than the average depth (at a rate de- 
pendent on the balancing of the tree) , we should not 
expect a great speedup from Procedure [5] for any but 
the most shallow trees. 



4- Experiments on Parallel Classifica- 
tion Tree Algorithms 

The preceding analysis assumes each parallel node 
execution is independent from the others. In GPUs, 
particularly CUDA architecture, this is not the case. 
We expect to see a performance benefit due to local 
caching of neighboring records read from global mem- 
ory in bursts, the SIMD coupling of execution nodes 
evaluated in parallel for each sample, having multiple 
SIMD groups resident and quickly switched to on the 
chosen hardware, and other such concerns. These are 
not general concerns but are specific to a particular 
hardware architecture. In this setting, it makes sense 
to pursue more specific analysis by experimentation. 
The following sections detail experiments done on the 
CUDA platform with runtime as the metric of per- 
formance. 

4.1 Problem Selection 

We selected the Image Segmentation dataset from 
UC Irvine's Machine Learning Repository [17] as an 
evaluation problem representative of tasks in medical 
imaging or computer vision. This data set consists 
of 2310 records for training and an additional 2099 



for testing. Each record consists of 19 real-valued at- 
tributes of a 3 x 3 pixel neighborhood and corresponds 
to one of 7 discrete classes. 

To generate a classifier based on this dataset, 
we used the Orange component-based machine learn- 
ing library available from [9]. This library provides 
Python bindings to a mature CH — h machine learning 
library. We wrote a Python script to read the train- 
ing set, train a classification tree, and generate CH — h 
source code which encodes that tree according to Pro- 
cedure [T] The resulting tree is shown schematically 
in Figure [3j This tree has N = 31 nodes, 16 leaves, 
and a depth of 11. 

Further, the script also combined the training set 
and the test set of records into a single table, then re- 
peatedly randomized and output the records as C++ 
source code for easy inclusion in our test program. 
This process was repeated until 16,384 C++ records 
were generated. This set can be duplicated four times 
at runtime to create a dataset having 65,536 records, 
representing an image of 256 x 256 pixels. 

4.2 Experiment Setup 

4.2.1 Machine Configuration 

Experiments were performed on a Dell Optiplex 780 
with an Intel Core2 Duo E8600 CPU running at 3.33 
GHz, 4 GB RAM, and the Windows 7 64-bit oper- 
ating system. An NVIDIA Quadro 2000 GPU card 
was added with 1 GB of 128-bit RAM with a band- 
width of 41.6 GB/s and 192 CUDA cores in 4 SMs 
of 48 cores each with a 1.25 GHz processor clock. 
Software on the system included the NVIDIA driver 
version 263.06 and the CUDA 3.2.1 runtime Dll ver- 
sion 8.17.12.6303. All compilation was done with Mi- 
crosoft Visual Studio 2008 and the CUDA 3.2 De- 
velopment Toolkit, with project files generated by 
CMake version 2.8.3. 

4.2.2 Tests Conducted 

We created a program which, after building a dataset 
of 65,536 records, ran three tree evaluation functions 
500 times each on the full dataset. For each func- 
tion call, the Windows high performance counter was 
started before and stopped after the call and the delta 
time was accumulated. This is called the outer time 
for the algorithm. For those functions using a CUDA 
kernel, a similar inner time was collected around just 
the kernel invocation and excluded any time for mem- 
ory copy to or from the GPU. During the kernel run- 
time, the host CPU was made to wait until the kernel 
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completed. The three functions evaluated were as fol- 
lows: 

EvalTreeO: This function implements Procedure [2] 
a serial algorithm running on the host. Note 
that this function records no inner time and 
that the outer time does not include any mem- 
ory copies since none are required for the host 
to evaluate the dataset. 

EvalTreeBySample () : This is the data parallel algo- 
rithm given in Procedure [3] This function is 
written in CUD A C, and performs a host-to- 
device copy of the dataset and the tree defi- 
nition before invoking the kernel. The grid is 
formed of 512 blocks having 128 threads each, 
all single-dimensioned. Only one record is eval- 
uated per thread (i.e. m — 1.) For this func- 
tion (and all other CUD A functions), the tree is 
copied to device constant memory for caching 
purposes. When the kernel completes, the host 
copies the resulting class assignments back to 
host memory and frees all device resources. 

EvalTreeByNode () : This function fully implements 
the improved speculative algorithm correspond- 
ing to Procedure [5] with the following con- 
siderations: constant memory is used for the 
processor-node map and static leaf path buffers 
in addition to the tree definition; multiple 
reductions (specifically 2, determined empiri- 
cally) are performed per iteration of the path 
reduction loop; and the explicit barrier() op- 
erations are omitted since each thread executes 
synchronously within a warp. The shared mem- 
ory path variable is initialized from the static 
leaf buffer only once at kernel invocation. This 
is safe since leaves never change and internal 
nodes are re-initialized by the node evaluation 
step done for each record. The grid is set to 128 
blocks of 16 x 16 threads. Thus each block pro- 
cesses 16 record groups in parallel, each record 
group using p = 16 threads (a half-warp) to 
evaluate a record. Note that there are only 15 
internal nodes in the tree, so one thread is idle 
per record group (assigned to a phantom node). 
With 128 x 16 record groups, each must pro- 
cess m = 32 records per group to cover 65,536 
records exactly. Having thread geometry ex- 
actly match data size allows us to remove checks 
for over-sized grids-a non-portable practice but 
one with a noticeable performance effect. Data 



copies to and from the device were the same as 
in EvalTreeBySample () . 

After each CUDA function call, the returned 
buffer of class assignments was compared to the re- 
sults obtained using the serial algorithm, and any dis- 
crepancies were reported. None were found. 

The entire program also ran with the CUDA pro- 
filer enabled. This facility captures device times- 
tamps and other metrics resulting from the program 
execution. 

4.3 Results 

The program output giving the outer and inner times 
along with related statistics is summarized in Tablejl] 
Most notable is that the serial evaluation on the host 
is twice as fast as the fastest parallel GPU version. 
This is surprising but perhaps a bit misleading, since 
no great pains were taken to optimize the memory 
copy tasks, all done in series. Pinning and aligning 
the host memory buffers and overlapping copies with 
computation are viable techniques to boost perfor- 
mance for this problem. However, it does point out 
that the methods used in [15j by Sharp to measure 
a speedup of two orders of magnitude may be mis- 
matched with our methods. Sharp also does not give 
the serial algorithm used to compare with the parallel 
algorithm, suggesting that perhaps a branchless serial 
algorithm performs better than that used in |15| . 

In our main result, comparing the inner 
times for kernel execution we see a roughly 
25% performance increase in EvalTreeByNode 
over EvalTreeBySample. Further experiments on 
EvalTreeByNode showed that inclusion of a condi- 
tional for checking an over-sized warp increased run- 
time to roughly the same as EvalTreeBySample. 
With tyl — 1 , timings were again roughly equal, show- 
ing that the expense of the initial load of static paths 
and the processor-node map are amortized over mul- 
tiple record iterations. Values for m > 32 (with 
related block resizing) showed no significant bene- 
fit. This and other experiments suggests that CUDA 
thread scheduling is as efficient as iterating in a for 
loop. 

Examination of the CUDA profiler output shows 
similar results for kernel timings (Figure [4]) , though 
uniformly lower than those measurable outside of the 
CUDA driver. The GPU times confirm a ~25% im- 
provement in kernel times of 353.47/is vs 485.17/Lts. 
The time in the graph for "memcpyHtoD" shows the 
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Table 1: Outer and Inner Times According to High-Performance Counter 
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copy time of the data set and tree definitions (two 
invocations per execution) for both CUDA functions 
over 500 iterations each. Adding this and the "mem- 
cpyDtoH" time to each of the kernel times gives the 
outer time for each function, less time taken by the 
host to allocate / free buffers and manage the function 
calls. 

The profiler data also shows EvalTreeByNode tak- 
ing an average of 4373 divergent branches across all 
threads due to the half-warp scheduling, whereas 
EvalTreeBySample shows 3530 across all threads, 
as each thread in a warp will iterate through the 
tree a different number of times. EvalTreeByNode 
had a global cache read hit rate of 70%, while 
EvalTreeBySample had a hit rate of only 31%. 

With fewer threads per block, EvalTreeBySample 
encounters the limit on active blocks, leaving the 
achieved occupancy rate at 66%. EvalTreeByNode 
avoids this issue and achieves 100% occupancy. This 
increases the number of global memory requests for 
record data that can be active, and thus enhances 
the effect of latency hiding by the warp scheduler. 
This can be seen in the global memory write through- 
put of 0.643 GB/s versus 4.68 GB/s. Read through- 
puts are roughly equal at 14 GB/s (due to caching), 
giving overall global memory throughputs of 15.43 
GB/s for EvalTreeBySample and 19.41 GB/s for 
EvalTreeByNode. 



5. Conclusion 

We have shown a speculative decomposition algo- 
rithm for parallel classification tree evaluation that 
surpasses the performance of a data decomposition 
parallel algorithm on the CUDA platform. When ig- 
noring the common, serial algorithm setup process- 
ing, the speculative approach is 25% faster than the 
data parallel approach in our particular problem in- 



stance. This demonstrates how different parallel de- 
composition techniques can maximize the advantages 
of a given platform. In a SIMD environment, we 
see that speculative decomposition into many time- 
uniform tasks can have a helpful effect even at the 
cost of less efficient hardware utilization. We also see 
a good example of implementation results deviating 
from asymptotic theoretical analysis. This is most 
true when fundamental assumptions, such as inde- 
pendent execution units, do not hold in the imple- 
mentation as is the case here. Ultimately, the best 
performance requires a careful balance of machine 
and algorithm for a specific problem. 

Additionally, we've seen that measurement tech- 
niques which do not include the entire program over- 
head of distributing data or that compare different 
algorithms can lead to confusing results. Though 
we have implemented a very similar program to |15| , 
our serial host implementation is roughly twice as 
fast when all overhead in included, compared to 100 
times faster as Sharp reports. Surely, some difference 
in host speed, GPU power, and lower overhead cost 
when processing forests rather than single trees is re- 
sponsible for part of this discrepancy. The remaining 
difference suggests that the branchless evaluation al- 
gorithm ought to be used as the best known serial 
algorithm for speedup comparisons. 



6. Further Work 

The breadth of this result should be tested against 
other tree geometries (e.g. more or less balanced, 
deeper or more shallow) and record distributions (or- 
dered vs. random) to observe the effect different data 
organizations can have on run times. Also, applica- 
tion of these algorithms to more traditional SIMD, 
i.e. vector, processors would be interesting. Com- 
paring CUDA compute 1.x devices with 2.x devices 
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Gpu Time Summary Plot 
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Figure 4: Average timings taken by CUDA runtime over 500 executions (/zs) 



might also provide additional insights. 

To extend the current work, application to very 
large trees might be achieved by evaluating only a 
small "window" on the tree, starting at a root node 
and evaluating only the next few levels. Once re- 
duced, the resulting node would then become the root 
of the next window and the process repeated. This 
approach may be useful in overcoming SIMD con- 
currency limits (such as on a vectored processor) or 
the exponential growth of memory demand for deeper 
and deeper levels of the tree. 
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